Toolbox Programs
Let's get building!

About

This is a project for one of our classes in the second semester of our Master's program in Natural Language Processing.



Objective

Automate a process to extract and look at various lexical patterns within a large corpus.

A series of tools

Write a series of scripts in Perl that can be generalized and modified to work in a variety of contexts for similar goals.

Results

Display results as a collection of graphs side by side with our interpretations.




How our tools work together:


  • THE DATA

    The raw data

    Le Monde

    Our raw data comes from an RSS 2014 news feed from French newspaper Le Monde. The files are in XML format and organized in folders by month.


  • 1

    Tool 1

    Extracting headlines and descriptions

    Our first tool reads the files within their file structure and extracts everything between the <title> and <description> tags. This data is then cleaned to replace escaped characters, remove images, and other untreatable data.


  • 2

    Tool 2

    Part-of-speech tagging

    This tool takes the output from Tool 1 and passes it to two different part-of-speech taggers: TreeTagger and Cordial.


  • 3

    Tool 3

    Morphosyntactic phrase searching

    Tool 3 is composed of two different scripts, one for each output from TreeTagger and Cordial. It searches within the outputs for specified morphosyntactic patterns (e.g.: noun-preposition-noun).


  • 4

    Tool 4

    Graphing the results

    Our last tool uses the found patterns to create visual representations of how these phrases manifest within the text.




  • Analysis

Tool 1

Read the files within their file structure and extract everything between the <title> and <description> tags. Clean to replace escaped characters, remove images, and other untreatable data...



Overview

What does Tool 1 do?

In-class versions

Perl

Other versions

Perl with XPATH

Our version 1

Pure Perl

Our version 2

Perl Modules

Results

A sample

Tool 2

Take the output from Tool 1 and pass it to two different part-of-speech taggers: TreeTagger and Cordial.



Overview

what does Tool 2 do?

In-class version

The professor's method

Le Trameur

Tool 2 & Tool 3 & Tool 4

Our version 1

Pure Perl

Our version 2

Perl modules

Results

A sample

Tool 3

Search within the outputs of TreeTagger and Cordial for specified morphosyntactic patterns (e.g.: noun-preposition-noun).



Overview

What does Tool 3 do?

In-class versions

Using Perl

Other versions

Using XPATH

Our versions

Modification of the scripts

Patterns

Patterns used

Results

Extracted phrases

Tool 4

Use the found patterns to create visual representations of how these phrases manifest within the text.



Overview

How we create these graphs

Graph 1

noun-adjective

Graph 2

noun-prep-det-noun

Graph 3

PCTFORTE ":"

Graph 4

CONJUNCTION

Analysis

What can we conclude?

The Team



Alexandre Cavalcante

Portuguese, French

Genevieve Bienvenue

English, French

Virginie Poadey

French, Japanese

We are first-year Masters students in a Natural Language Processing program at the Institut National des Langues et Civilisations Orientales (INALCO) in Paris, France. You can find a detailed description of our program (in French) here.